Homework 2 is due by noon, Tues 9/28. Please complete the assignment in this Markdown document, filling in your answers and R code below. I didn’t create answer and R chunk fields like I did with homework 1, but please fill in your answers and R code in the same manner as hw 1. Make sure to follow the homework guidelines when writing up this assignment (handout is located on the right side of moodle page).
Tips for using Markdown with homework sets:
Work through a problem by putting your R code into R chunks in this .Rmd. Run the R code to make sure it works, then knit the .Rmd to verify they work in that environment.
Make sure you load your data in the .Rmd and include any needed library commands.
Feel free to edit or delete questions, instructions, or code provided in this file when producing your homework solution.
For your final document, you can change the output type from html_document to word_document or pdf_document. These two to output types are better formatted for printing.
Keep the hw problems in problem order I give in this doc. You can attach hand-written work for a problem (if needed) but make it clear in this doc when you are answering a problem using work attached to your printed pdf/word doc.
Comment: Complete this problem “by hand” using the info in Display 7.9 (i.e. don’t load the data and fit a lm). Use the qt command in R to get your mutliplier \(t^*\) for the CI calculation.
Answer: As shown below, we have evidence that the effect of measured recession velocity on average measured distance is statistically significant (t = 2.073873, d.f. = 22, p = 0.0000045). We are 95% confident that a 1 km/s increase in recession velocity is associated with a 0.0008672125 to 0.0018787875 km increase in measured distance.
qt(.975,22)
## [1] 2.073873
0.001373 + c(-1,1)*qt(.975,10)*0.000227
## [1] 0.0008672125 0.0018787875
Recall the agstrat.csv data used for homework 1. This was a stratified random sample of US counties. We will consider the regression of farm acreage in 1992 (y=acres92) on farm acreage in 1982 (x=acres82).
ggplot2 package to create a scatterplot of acres92 against acres82 that includes the fitted regression line. Describe the direction, form and strenth of the relationship shown in this graph.answer: Form: Mostly linear with no visible curvature observed, Direction: Positive (as one increases the other increases as well) Strength: Strong, not much variation.
library(ggplot2)
agstrat <- read.csv("http://people.carleton.edu/~kstclair/data/agstrat.csv")
ggplot(agstrat, aes(x = acres82, y = acres92)) +
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Acres82 vs. Acres92")
acres92 on acres82. Interpret the slope in context for this problem.agstrat_lm <- lm(acres92 ~ acres82, data = agstrat)
summary(agstrat_lm)
##
## Call:
## lm(formula = acres92 ~ acres82, data = agstrat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -249540 -12517 -2286 7354 261600
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.532e+03 3.202e+03 -1.728 0.085 .
## acres82 9.844e-01 7.035e-03 139.923 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 41050 on 298 degrees of freedom
## Multiple R-squared: 0.985, Adjusted R-squared: 0.985
## F-statistic: 1.958e+04 on 1 and 298 DF, p-value: < 2.2e-16
ggplot(agstrat, aes(x = acres82, y = acres92)) +
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Acres82 vs. Acres92")
answer: \(\mu_{acres82|acres92} = -5532 + 0.9844(acres92), \hat \sigma = 41050\). According to this fitted model, for every 1 acreage increase in farm acreage in 1982 is associated with a 0.9844 increase in farm acreage in 1992.
answer: \(t-stat = \frac{0.9844 - 0}{0.007035} = 139.9289, p-value = 2P(T > 139.9289) = 2e-16\). Since we have a p-value that is significantly less than 0.05 (p-value = 2e-16). This means we have evidence that the estimated effect of farm acreage in 1982 on farm acreage in 1992 is statistically significant.
answer: Linearity: Looking at the scatter plot we can see that the mean response does indeed vary linearly with \(x\). Constant Variance: We can also see that the model errors are not associated with \(x\). Normality: From the histogram we can see that the \(\mu_{y|x}\) can be described by a normally distributed model. Independence: Since these values were measured over a decade, we can say that they are temporally (time) correlated and are therefore independent
library(ggplot2)
agstrat <- read.csv("http://people.carleton.edu/~kstclair/data/agstrat.csv")
agstrat_lm <- lm(acres92 ~ acres82, data = agstrat)
library(broom)
agstrat_aug <- augment(agstrat_lm)
head(agstrat_aug)
## # A tibble: 6 × 8
## acres92 acres82 .fitted .resid .hat .sigma .cooksd .std.resid
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 297326 319619 309102. -11776. 0.00334 41117. 0.000138 -0.287
## 2 124694 139111 131409. -6715. 0.00415 41121. 0.0000560 -0.164
## 3 246938 268434 258715. -11777. 0.00337 41117. 0.000140 -0.287
## 4 206781 197055 188450. 18331. 0.00368 41109. 0.000370 0.447
## 5 78772 89331 82406. -3634. 0.00471 41122. 0.0000186 -0.0887
## 6 210897 213105 204249. 6648. 0.00359 41121. 0.0000474 0.162
ggplot(agstrat, aes(x = acres82, y = acres92)) +
geom_point() +
geom_smooth(se = FALSE)+
labs(title = "Acres82 vs. Acres92")
library(ggResidpanel)
resid_xpanel(agstrat_lm)
hist(agstrat_aug$.resid)
plot(agstrat_lm, which = 2)
resid_panel(agstrat_lm, plots = "qq")
resid_interact(agstrat_lm, plots = "qq")
case0702 which is in the Sleuth3 package. Your R chunk should look likelibrary(Sleuth3)
head(case0702)
## Time pH
## 1 1 7.02
## 2 1 6.93
## 3 2 6.42
## 4 2 6.51
## 5 4 6.07
## 6 4 5.99
log(TIME) as the predictor in your model: lm(pH ~ log(Time), data=case0702)predict command in R. Important note: If you log a predictor in your lm command, e.g. lm(y ~ log(x)), then you give the predict command the value of x on the original (unlogged) scale when entering a value for newdata.Sleuth3 package (ex0817).ggplot2 package let’s you easily do this without applying those functions to your variables, instead you add another layer that tells R how to scale a particular axis. This method of visualization is nice because your numerical labels on the x/y axis will still be measured in the original units of the variables. If you want to, say, look at the scatterplot of sqrt(y) against log10(x) (base-10 log) you would add the layers scale_y_sqrt() and scale_x_log10() to your scatterplot of y against x. For this example, that would look like:ggplot(pest, aes(x= Load, y = Mass)) +
geom_point() +
scale_y_sqrt() +
scale_x_log10()
ex0327scale layers described in Problem 4 hints.filter command from dplyr to make a version of the data set that only contains cases where DurationOfVisit < 31The (hidden) R chunk below defines a function named reg.sim2 that samples responses from a SLR model given a vector of explanatory values \(x\). You will use the following SLR model to generate your set of responses: \[
Y_{i} = 20 + 1x_{i} + \epsilon_{i} \ \ \ \ \epsilon_{i} \sim N(0,2)
\] so that \(\beta_{0}=20, \beta_{1}=1\) and \(\sigma=2\).
We start by generating a sample of 1000 explanatory variable values. Suppose the explanatory variable \(x\) is equally (uniformly) distribution between 1 and 10. Generate \(x\) and view its distribution (use whatever seed value you like):
set.seed(7)
x <- seq(from=1,to=10,length=1000)
hist(x)
Next, for each of the 1000 \(x_i\)’s that you just created, use the reg.sim2 function to generate 1000 responses \(y_i\) from the population model described at the start of this problem:
reg.sim2(x, beta0=20, beta1=1, sigma=2, grph=T)
## $b0
## [1] 20.23249
##
## $b1
## [1] 0.9588374
##
## $SE.b0
## [1] 0.1451941
##
## $SE.b1
## [1] 0.0238654
The R output gives the estimated slope and intercept (\(\hat{\beta}_{1}, \hat{\beta}_{0}\)), a scatterplot of the data (along with the sample and population regression lines), and histograms/QQnormal plots for the responses \(y_i\) and the residuals \(r_{i}\). Use the histograms and QQ normal plots to answer the following questions:
Are the sampled \(y\)s normally distributed? If not, describe the general shape of their distribution.
Are the residuals normally distributed? If not, describe the general shape of their distribution.
rgamma(1000, 1, 1/2).rnorm(1000, 10, 2).